Data Replication

Back to Data-Science/Elasticsearch

Elasticsearch clusters have multiple indices. Each index has multiple shards, and each shard (primary shard) has replicas.

Replication provides high availability, and scaling search throughput by allowing operations to run on replicas.

YELLOW status for an index means that all primary shards are allocated (on the Elasticsearch data nodes) but the replicas have not been allocated yet (as opposed to GREEN where everything is allocated). In other words, replica rules aren't satisfied and there is risk of data loss.

Elasticsearch can configure many parameters (shards/AZ, rack awareness) that amount to a scheduling heuristic, which ultimately may result in some replicas having no well-suited host.

Split Brains

An elasticsearch cluster has possibly multiple data nodes. It is important to consider the possibility of network failure through partitioning.

Split brain is when you have multiple partitions of a distributed system functioning with multiple nodes thinking that they're master. This needs to be avoided.